Introduction to Open Data Science - Course Project

About the project

This will be my (i.e. Baran’s) project page for the assignments of the course PHD-302 Introduction to Open Data Science, University of Helsinki.

Assignment 1:

Here is my GitHub repository: https://github.com/bbayraktaroglu/IODS-project

date()
## [1] "Tue Dec 13 06:28:44 2022"

Some thoughts about the course:

I have never learned R in my previous programming courses, so this was a little bit of an overwhelming start for me. This is also officially my first time using GitHub. I tried to use it before for some personal projects, but failed to understand its general use. I recently heard about the course through the DONASCI emailing list, and I previously heard about the course from a friend of mine who also suggested for me to take the course. I expect to learn whatever I can about data science in an advanced level. The last time I took a rigorous statistics course (or any programming language course) was during my bachelor studies.

##Thoughts about Exercise set 1 and the R for Health Data Science book:

It seems that the exercise set and the first few chapters of the book provide the basic essentials for learning R as a programming language, with its own quirks and conventions. I found the “pipe function”, i.e. “%>%”, the most out of ordinary way of assigning an input to a function. It is sometimes hard to understand why it is used instead of the generic way of computing a function. Other than this, R seems to be an intuitive language, with easy to understand commands.

R Markdown seems to be a very neat notebook like Latex compilers, or Jupyter notebook. I found it easy to understand, but it will take time to get used to its various syntax.


Assignment 2: Linear Regression

This week I have worked on linear regression. To be honest, last time I studied this subject was almost 8 years ago during my bachelor studies, and although the subject is quite easy, I still find some parts quite fascinating. I have never worked with R, so this week was more of an introduction to hands-on R experience compared to last week’s assignment. R syntax seems intuitive, compared to less user-friendly languages like C or even Java.

date()
## [1] "Tue Dec 13 06:28:44 2022"

2.1 Data wrangling

The task for data wrangling seemed daunting at first, but the individual steps were already built from the ground up in the exercise set, so I have not gotten into any trouble.

2.2 Analysis

Setting up the packages

library(tidyverse)
library(dplyr)
library(ggplot2) 
library(GGally)            
library(purrr)

2.2.1: Reading the dataset

# reading the required file for the assignment
students2014 <- read.csv("learning2014.csv", sep = ",", header = TRUE)

We now compute the dimensions of the data and look at its structure:

dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Description of the dataset:

There are 166 observations (each representing a student) and 7 variables in this dataset. The data as a whole was collected as a survey between 2014 and 2015. The variables which are selected for this assignment try to keep track of what type of pedagogical learning method the students used, their overall attitude towards statistics, together with information about their gender, age and exam points. Here is a table with the definitions of the variables:

Variable Variable Type Definition
gender Character gender of the student, M(male)/F(female)
age Integer age of the student
attitude Numeric (double) average of student’s overall attitude toward statistics, scale between 1-5
deep Numeric (double) deep learning metric, scale between 1-5
stra Numeric (double) strategic learning metric, scale between 1-5
surf Numeric (double) surface learning metric, scale between 1-5
points Integer exam points of the student, scale between 1-5

2.2.2: Graphical overview and summaries

We draw a graphical overview of the dataset:

ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

We also show summaries of the variables:

summary(students2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Comments about the outputs

One can see from the graphical overview the scatter plot, the correlations, and the probability distributions of pairs of each of the variables. And from the summary, one can see the various minima, maxima and mean. Female gender is colored in red, while the male gender is colored in blue.

We see that there are considerably more females than males in the study. Females seem to be much younger than the average male, and the females’ attitude towards statistics seem to be considerably lower than their male counterparts. There seems to be a strong positive correlation between attitude and exam points, for both genders. Interestingly, there is a strong negative correlation between attitude and surface learning for males, while there is no significant conclusion for females. Similarly for the correlation between surface and deep learning. Negative correlation means that male students who prefer surface learning are more likely to have a negative attitude towards statistics.

2.2.3: Model fitting

We choose the variables attitude, strategic learning and surface learning as explanatory variables, and construct a linear regression for the dependent variable “exam points”.

my_model <- lm(points ~ attitude + stra +surf, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Comments about the result:

One can see the the adjusted R-squared value is 0.1927, which means the variables attitude, strategic learning and surface learning can explain up to 19.27% deviation within the exam points of a student. Moreover, attitude is considerably significant with a p-value of about \(1.93*10^{-8}\), much less than the general lowest threshold of 0.001. Unfortunately, the other variables are not significant, with p-values above 0.1. So it is highly unlikely that strategic learning and surface learning have an explanatory power as much as attitude. The model has an overall p-value of \(3.156*10^{-8}\), which is very low, so the model is significant overall.

Another model

We now remove the variables stra and surf, since both are not very statistically significant, and try to form a new model:

my_model2 <- lm(points ~ attitude , data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Comments about the new result:

As expected, the new result improved the statistical significance of the remaining explanatory variable “attitude”, to about \(4.12*10^{-9}\). The overall p-value is also the same as the one for attitude since we are now using a univariate linear regression. Thus, when we compare this model to the previous one, there has been a significant increase in the trustworthiness of the model. But the adjusted R-square value is now 0.1856, which is lower than the previous one. This means that the model has lost some explanatory power, and now can explain up to 18.56% deviation within the exam points. This is expected, since if we remove variables from a model, the explanatory power is expected to decrease, but not by much. The multiple R-squared is not an important metric in this case, since we only have one explanatory variable.

2.2.4: Diagnostics

# place the following four graphics in same plot
par(mfrow = c(2,2))
# draw diagnostic plots for the final model
plot(my_model2, which = c(1,2,5))

Final comments about the diagnostics:

The final model seems to be fitting our expectations. Q-Q plot is mostly along the line, which means that the distribution of the model mostly follows that of the normal distribution. Residuals vs Fitted plot shows us that most of the points follow along the line residual=0 in a horizontal strip, which means that the result is well-behaved. There are no obvious outliers, and the result seems random enough. So the assumption of linearity is well-supported. Finally, Residuals vs Leverage plot tells us that there are two data points (namely 56 and 35) sitting very close to Cook’s distance, but they do not fall outside of it. Thus none of the data points possess any influential effect on the regression model, but further analysis on the data points 56 and 35 can be made just to be sure.


Assignment 3: Logistic Regression

This week I have worked on logistic regression. Slowly but surely, I am starting to feel comfortable with R and RMarkdown. I hope next week is going to be even more easier for me. Using some peer reviews that I have obtained last week, there was an overhaul in my course diary. Now, it should look nicer.

date()
## [1] "Tue Dec 13 06:29:00 2022"

3.1: Data wrangling

This week, data wrangling felt considerably easy. I followed the tasks and used some help from the Exercise 3. The R code of the data wrangling part is in the data folder of my Github repository. I will put the link here as well: https://github.com/bbayraktaroglu/IODS-project/blob/master/data/create_alc.R

3.2 Analysis

Setting up the packages

library(tidyverse)
library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
library(boot)
library(GGally)            
library(purrr)
library(gmodels)
library(knitr)
library(patchwork)
library(finalfit)
library(stringr)
library(caTools) 
library(caret)

3.2.1: Reading the dataset

# set the working directory
setwd("/Users/barancik/Github/IODS-project/data")
# reading the required file for the assignment
alc <- read.csv("alc.csv", sep = ",", header = TRUE)

We now compute the dimensions of the data and look at its structure:

dim(alc)
## [1] 370  35
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Description of the dataset:

There are 370 observations (each representing a student) and 35 variables in this dataset. The data as a whole was collected as a survey on 27.11.2014, from two different Portuguese schools. The data consists of measurements regarding success of the students in two different subjects: Mathematics and Portuguese language. The variables in this assignment try to keep track of some background information about the student, like age, sex, etc., and important measures regarding the success of the students such as number of past class failures, number of school absences, current health status, alcohol consumption, etc. Some of the variables are binary like the sex or internet access, some are numeric, and some are nominal answers like ‘mother’s job’. Some numeric ones are between 1-5, while some are not bounded. The grades (G1, G2, G3) are between 0-20, and each represent grades obtained in different periods.

The dataset for maths and Portuguese language are combined by taking averages, including the grade variables. We combined the data into one single data which only includes the students who took both courses. The variable ‘alc_use’ is the average of workday alcohol consumption and weekend alcohol consumption. ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

3.3.2: Hypotheses

Our main aim is to understand the relationship between alcohol consumption and other variables in the data. We choose the variables ‘failures’, ‘absences’, ‘sex’ and ‘famrel’. We hypothesize that there is a correlation between ‘high_use’ and ‘failures’ (number of past failures), ‘absences’ (number of school absences) and ‘famrel’ (quality of family relations). We also hypothesize that there is a correlation between being a male and high consumption of alcohol.

Having high alcohol consumption should in principle be correlated with the number of past failures, since the student might have a serious alcohol problem, thus creating high number of failures.

Similarly, high alcohol consumption is expected to be correlated with high number of absences, since if the student is intoxicated almost always, then attending a class becomes difficult if not impossible.

For family relations, we expect that bad family relations is correlated with high alcohol consumption, since students may try to escape from troublesome relations at home and alcohol is one such solution.

Finally, we expect high alcohol consumption from male students, but we accept that this could be read off as a sexist expectation.

3.3.3 Plots

We now draw some plots regarding high alcohol usage versus the hypothesized variables above:

# put the hypothesized  variables in new data frame
keep_columns <- c("high_use", "failures", "absences", "famrel", "sex")
alc_hypo <- select(alc, one_of(keep_columns))

Let’s now draw a scatter plot to first summarize everything:

ggpairs(alc_hypo, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

Now let’s start with a bar plot between high alcohol consumption and sex:

# initialize a plot of 'high_use'
g1 <- ggplot(data = alc, aes(x = high_use))

# draw a bar plot of high_use by sex
g1 + geom_bar()+facet_wrap("sex")

We can see that there can definitely be some correlation with being a male and having high alcohol consumption. Percentage of females who drink is very small compared to females who do not. But this ratio increases for males.

We now construct a bar plot of each variable:

# initialize a plot of 'high_use'
g2 <- ggplot(data = alc, aes(x = high_use))

# draw a bar plot of high_use by failures
g2 + geom_bar()+facet_wrap("failures")

We see that eventually, similar to sex, the ratio of high to low alcohol consumption increases as the number of past failures increase. So there could be some correlation.

# initialize a plot of 'high_use'
g3 <- ggplot(data = alc, aes(x = high_use))

# draw a bar plot of high_use by absences
g3 + geom_bar()+facet_wrap("absences")

We see that similar to ‘sex’ and ‘failures’, the ratio of high to low alcohol consumption increases as the number of absences increase. So there could again be some correlation.

# initialize a plot of 'high_use'
g3 <- ggplot(data = alc, aes(x = high_use))

# draw a bar plot of high_use by family relations
g3 + geom_bar()+facet_wrap("famrel")

Finally for family relations, we again have a similar situation, but it is a little bit complicated. Overall, it seems again that the ratio of high to low alcohol consumption increases as the family relations get worse.

We also draw a bar plot which includes all of our explanatory variables, together with the dependent variable:

# draw a bar plot of each variable
gather(alc_hypo) %>% ggplot(aes(value)) + geom_bar()+ facet_wrap("key", scales = "free")

Finally, a boxplot of family relations and absences by alcohol consumption and sex:

# initialize a plot of high_use and family relations
g1 <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("family relations")+ggtitle("Student family relations by alcohol consumption and sex")

# initialize a plot of high_use and absences
g2<- ggplot(alc, aes(x = high_use, y = absences, col = sex))

# define the plot as a box plot and draw it
g2 + geom_boxplot() + ylab("absences") +ggtitle("Student absences by alcohol consumption and sex")

Overall observations

We see that there could be some correlation between the hypothesized explanatory variables (failures, absences, sex, family relations) and the dependent variable (high alcohol consumption). We will further analyze this.

3.3.4 Using logistic regression

We now move onto a more statistical way of showing why our hypotheses are (significantly) true. We will use logistic regression to accomplish this:

# find the model with glm()
m <- glm(high_use ~ failures + absences + sex + famrel, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + famrel, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0786  -0.8216  -0.5746   0.9760   2.1820  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.79406    0.54281  -1.463  0.14350    
## failures     0.57328    0.20531   2.792  0.00523 ** 
## absences     0.08941    0.02274   3.932 8.43e-05 ***
## sexM         1.04800    0.25091   4.177 2.96e-05 ***
## famrel      -0.29791    0.13044  -2.284  0.02238 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 401.77  on 365  degrees of freedom
## AIC: 411.77
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)    failures    absences        sexM      famrel 
## -0.79406437  0.57327802  0.08940969  1.04800182 -0.29791173
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI<- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.4520039 0.1532656 1.2982302
## failures    1.7740730 1.1936470 2.6881229
## absences    1.0935286 1.0480739 1.1462668
## sexM        2.8519467 1.7556470 4.7047758
## famrel      0.7423669 0.5735490 0.9583804

We now interpret the summary. Observe that absences and sex (male) have a highly significant (positive, since coefficient is positive) correlation with high_use, with p-value less than 0.001. Failures have a significant (positive) correlation with high_use, with p-value between 0.01 and 0.001. Finally, family relations have a significant (negative, since the coefficient is negative) correlation with high_use, with p-value between 0.05 and 0.01. All of our hypotheses can be accepted and are indeed significant enough. If one surmises that the p-value should be less than 0.01 to achieve even greater significance, then family relations loses its significant correlation with high-use.

We now interpret the coefficients as odd ratios. Note that the exponentials of the coefficients of a logistic regression model can be interpreted as odds ratios between a unit change (vs. no change) in the corresponding explanatory variable:

  1. We see that odd ratio of failure is about 1.77. This means that for each unit of failure, the increase in odds of having a student with high alcohol consumption is about 1.77 times. Thus, more failures mean higher odds of having high alcohol consumption, as hypothesized earlier.

  2. Similarly, for each unit of absences, the increase in odds of having a student with high alcohol consumption is about 1.09 times, which is very close to 1, thus there is almost no change in high alcohol consumption, but it is still greater than 1, so it is in line with our hypothesis.

  3. Odds ratio for sex (male) is about 2,85, which indicates that changing sex (i.e. increasing the odds of being a male), alters the odds of having a student with high alcohol consumption the most. This is also in line with our hypothesis, since we said that being a male should be positively correlated with high alcohol consumption.

  4. Finally, odds ratio of family relations is less than 1, which means that we are losing in the odds of having high alcohol consumption if we increase family relations. This also is in line with our hypothesis: better family relations=low alcohol consumption.

3.3.5 Predictive power of the model

We compute the predictive power of the model with failures, absences, sex and family relations as our explanatory variables and high_use as the dependent variable. We excluded none of the initial choice for the explanatory variables, since in the last section we found a significant correlation between them and high_use.

# fit the model
m <- glm(high_use ~ failures + absences + sex + famrel, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, famrel, high_use, probability, prediction) %>% tail(10)
##     failures absences sex famrel high_use probability prediction
## 361        0        3   M      4    FALSE   0.3386132      FALSE
## 362        1        0   M      4    FALSE   0.4098873      FALSE
## 363        1        7   M      5     TRUE   0.4908822      FALSE
## 364        0        1   F      5    FALSE   0.1002713      FALSE
## 365        0        6   F      4    FALSE   0.1901165      FALSE
## 366        1        2   F      5    FALSE   0.1777706      FALSE
## 367        0        2   F      4    FALSE   0.1410142      FALSE
## 368        0        3   F      1    FALSE   0.3049689      FALSE
## 369        0        4   M      2     TRUE   0.5039381       TRUE
## 370        0        2   M      4     TRUE   0.3188873      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   15
##    TRUE     77   34

We see that our model correctly predicts 244 false and 34 true observations. The rest are inaccurately classified individuals. We can graph the actual values vs predictions:

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use),aes(col=prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>%addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65945946 0.04054054 0.70000000
##    TRUE  0.20810811 0.09189189 0.30000000
##    Sum   0.86756757 0.13243243 1.00000000

We now compute the average number of inaccurately classified individuals:

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# the logistic regression model m and dataset alc with predictions are available

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486

We find a number of about 0.25. This means that on average, 1 out of 4 people are inaccurately classified, meaning that they are falsely accused of heavy drinking while actually being light drinkers, or vice versa.

3.3.6: Bonus

We perform 10-fold cross validation:

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2567568

We obtain a number of about 0.26, which is the same if not a little bit worse than the predictions in the Exercise. Thus the test performance is almost identical. This is largely due to family relations having a small impact on the dependent variable, compared to sex or failures, thus including family relations did not create a better model, and may in fact worsen it.


Assignment 4: Clustering and classification

This week I have worked on clustering and classification. This week definitely felt much more easier for me.

date()
## [1] "Tue Dec 13 06:29:19 2022"

4.1: Data wrangling

This week, data wrangling felt even easier than the last week. I mostly used some help from create_alc.R. The R code of the data wrangling part is in the data folder of my Github repository. I will put the link here as well: https://github.com/bbayraktaroglu/IODS-project/blob/master/data/create_human.R

4.2 Analysis

Setting up the packages

library(MASS)
library(dplyr)
library(tidyr)
library(tidyverse)
library(corrplot)
library(ggplot2)
library(plotly)

4.2.1: Reading the dataset

# reading the required file for the assignment
data("Boston")
# checking out its dimension, structure and summary
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

In the Boston dataset, there are 506 observations and 14 variables. It is included in the MASS package of R. This data frame contains gathered data related to housing values in suburbs of Boston. Most of the variables are numeric (float), while “chas” and “rad” variables are integers.

4.2.2 Plots

Let’s put our newly learned knowledge about correlation plots to good use. The following is the correlation matrix and its various plots of the Boston data:

# calculating the correlation matrix, also round it to 2 digits
cor_matrix <- cor(Boston) %>% round(digits=2)

# print the correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
library(corrplot)
corrplot(cor_matrix, method="circle")

corrplot(cor_matrix, method="number")

Observe that most of the variables are more or less correlated with each other, but the “chas” variable is mostly correlated with itself, while having correlation very close to 0 with the other variables. We know from basic probability theory that uncorrelated data does not imply independence, so we cannot infer that “chas” is independent from the other variables. We can only say that it is almost uncorrelated from the other variables. “rad” and “indus” has high overall positive correlation with most of the other variables (except “chas”). “rad” has 0.91 correlation with “tax” and 0.72 with “indus”. “indus” has -0.71 correlation (strong negative correlation) with “dis”, while “nox” has -0.77 correlation with “dis”.

4.2.3 Standardize the dataset and print scaled data summary

We will scale the data by subtract the column means from the corresponding columns and divide the difference with standard deviation. This normalizes the variables to be centered with standard deviation 1.

# scaling the Boston
boston_scaled <- as.data.frame(scale(Boston))

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We now create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). We will use the quantiles as the break points in the categorical variable.

We will then drop the old crime rate variable from the dataset. Afterwards, we divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

# creating a categorical variable called "crime" from scaled crime rate
boston_scaled$crim <- as.numeric(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- boston_scaled %>% dplyr::select(-crim)

# add the new categorical variable to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset
n <- nrow(boston_scaled) 

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# creating the train set
train <- boston_scaled[ind,]

# creating the test set 
test <- boston_scaled[-ind,]

4.2.4 Fit the LDA and draw its (bi)plot

We will now fit the linear discriminant analysis on the train set. We will use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. We then draw the LDA (bi)plot.

# linear discriminant analysis
lda.fit <- lda(crime ~ . , data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Observe that, the LDA plot predicts “rad” has the most variation in the dataset, towards the mostly “high” cluster.

4.2.5 LDA prediction

set.seed(123)

# saving the correct classes from test data
correct_classes <-test$crime

# removing the crime variable from test data
test <- dplyr::select(test, -crime)

# predicting classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulating the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       23       8        2    0
##   med_low    4      15        5    0
##   med_high   1       5       20    1
##   high       0       0        1   17

We find that almost all of the results are accurately predicted. Correctly classified observations are about 67, while the rest (about 35) are incorrectly classified. The inaccuracy rate of the LDA is about 34% (can be as low as 23% in some other sampling with another other seed).

4.2.6 K-means clustering

We reload Boston, rescale it and compute its Euclidean distance.

# reload the data
data("Boston")

# scale the data again
boston_scaled <- as.data.frame(scale(Boston))

# compute the Euclidean distance of Boston
dist_eu <- dist(boston_scaled)

# summary of dist_eu 
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

We now run the k-means algorithm:

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

The above plot extensively shows us that there is a significant drop at the value 2. Thus, the optimal number of clusters is 2.

We now run k-means algorithm again, this time with 2 clusters, and plot the Boston dataset with the clusters. The clusters will be colored in red and black.

set.seed(123)

# k-means clustering with 2 clusters
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

If one zooms in to the plot above, one would see that “rad” has nicely separated clusters across all of the possible pairings. “tax” also has good separation of clusters. The other variables are a complete mess, and no other conclusion can be drawn.

4.2.7 Bonus: K-means algorithm and LDA

We will now perform k-means algorithm on the original Boston data (after scaling). We choose 5 clusters. We then perform LDA using the clusters as target classes. We will include all the variables in the Boston data in the LDA model.

set.seed(5)
# reload the data
data("Boston")

# scale the data again
boston_scaled <- as.data.frame(scale(Boston))

# k-means clustering with 5 clusters
km <- kmeans(Boston, centers = 5)

# linear discriminant analysis on the clusters, with data=boston_scaled, and target variable km$cluster
lda.fit <- lda(km$cluster ~ ., data = boston_scaled)

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results. Note that lda.arrows is the same function we have used above
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters?

We observe in the above biplot that “tax” and “rad” have the most variation in the dataset. Moreover, the K-means seems to form accurate and separate clusters.

4.2.8 Super-Bonus: A cool 3D plot of LDA and K-means

We will recall the code for the (scaled) train data that we used to fit the LDA. We then create a matrix product, which is a projection of the data points.

set.seed(123)
# LDA

lda.fit <- lda(crime ~ ., data = train)

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

We now create a 3D plot of the columns of the matrix product:

library(plotly)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~train$crime)

Now let’s run the k-means algorithm on the matrix product with 4 clusters (since the number of clusters of crime is 4), and draw another 3D plot where the color is defined by the clusters of the k-means.

set.seed(5)
km = kmeans(model_predictors, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~factor(km$cluster))

The k-means clustering is mostly successful. One can see that there are 2 superclusters, while the clusters 1,2,4 (mostly) form their own subclusters under one of the superclusters. The cluster 3 is shared between the huge clusters. In the clusters for “crime”, “med_high” has this same property, while the other clusters are nicely separated into two superclusters. Thus, the k-means clustering plot with 4 clusters seems to give similar results compared to the lda.fit of the “crime” variable.


Assignment 5: Dimensionality reduction techniques

This week I have worked on dimensionality reduction techniques. It is getting easier and easier. It feels like using R is such a breeze now.

date()
## [1] "Tue Dec 13 06:29:33 2022"

4.1: Data wrangling

This week, data wrangling was related to the last week’s data wrangling. So I continued from where I left. The R code of the data wrangling part is in the data folder of my Github repository. I will put the link here as well: https://github.com/bbayraktaroglu/IODS-project/blob/master/data/create_human.R

4.2 Analysis

Setting up the packages

library(dplyr)
library(tidyr)
library(tidyverse)
library(corrplot)
library(stringr)
library(GGally)
library(ggplot2)
library(plotly)
library(FactoMineR)

4.2.1: Reading the dataset

# reading the required file for the assignment
human<-read.csv( "data/human.csv",row.names = 1)

4.2.2 Plots

#message=FALSE just deletes the extra messages that appear in ggpairs

# checking out its dimension, structure and summary
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# visualize the 'human' variables
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
cor(human)%>%corrplot()

Let’s interpret the data. From the ggpairs plot, we can see that the distributions of Edu2.FM, Labo.FM, Life.Exp, Edu,Exp and Parli.F are almost symmetric with respect to their means, while the distributions of GNI, Mat.Mor and Ado.Birth are highly skewed towards small values.

From the scatter plots and the correlation numbers, one can see that Ado.Birth is highly correlated with almost all of the variables, except Labo.FM and Parli.F. It is positively correlated with Mat.Mor with high significance, and negatively correlated with the rest. Thus, one can say that countries with high number of adolescent births also have high maternal mortality, which does make sense. Parli.F is not too much significantly correlated with any of the variables, except with Labo.FM with positive correlation. Mat.Mor and other variables are also similar to Ado.Birth: it is significantly correlated with almost all of the variables, except Labo.FM (not too significant), and Parli.F (no significance at all).

Finally, from the correlation plot, one can see that Labo.FM and Parli.F are definitely not correlated with any of the variables. This means that proportion of females to males in the labour force is not significantly correlated with any other variable, similar with percentage of female representatives in parliament. Thus, gender equality may not have a high effect on other variables.

However, Edu2.FM is also a metric of gender equality. It measures the proportion of females to males with at least secondary education. Edu2.FM is highly correlated with all of the variables, except Labo.FM and Parli.F as usual. It is positively correlated with Life.Exp, Edu.Exp and GNI, while being negatively correlated with Mat.Mor and Ado.Birth. This suggest that a more gender equal secondary education is highly correlated with a better overall society in terms of the variables such as high life expectancy, high expected years of schooling, high gross national income per capita, low maternal mortality and low adolescent births rate.

4.2.3 PCA

Let’s start with a PCA analysis of non-standardized variables:

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# print out a summary of pca_human, to show the variability
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1),col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Now we standardize the human data and perform PCA again:

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method), and print out its summary
pca_human <- prcomp(human_std)
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1),col = c("grey40", "deeppink2"))

# create and print out a summary of pca_human
s <- summary(pca_human)


# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 1)*100

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
##  50  20  10  10  10   0   0   0
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Now let’s interpret the results. Observe that the results are completely different than each other. In the non-standardized data, GNI is highly aligned with PC1 axis (negatively correlated), and its arrow has a very long length. This means that its variance is so high that PC1 can only capture GNI, and the effects of other variables are lost. This is due to the fact that the order of magnitude of GNI is of \(10^5\), while the other variables are mostly within order of magnitude 0.1-10. Thus, we definitely need to normalize GNI to see the effect of the other variables in the model.

Now, in the standardized data we can finally see the actual correlations of the other variables. Apart from GNI, we have Edu.Exp, Edu2.FM and Life.Exp all negatively correlated with PC1, while Ado.Birth and Mat.Mor are positively correlated with PC1. We also see Labo.FM and Parli.FM mostly positively correlated with PC2. We can also confirm that there is high positive correlation between GNI, Edu.Exp, Edu2.FM and Life.Exp, and high negative correlation between Ado.Birth and Mat.Mor. Labo.FM and Parli.FM are mostly positively correlated with each other. This confirms our earlier observation. A better quality of life within a country (which is a towards low PC1) equates to higher gross national income per capita, high expected years of schooling, better gender equality in terms of high proportion of females to males with at least secondary education and high life expectancy. We also have a correlation between better quality of life and low maternal mortality and low adolescent births rate. But we have almost no correlation between better quality of life and gender equality in terms of the variables: proportion of females to males in the labour force and percentage of female representatives in parliament. All of this agrees with the previous correlation plot analysis.

4.2.4 Interpreting the PCs

As we have said, it is apparent that PC1 is related to the development of a country in terms of overall quality of life. As we have found out, GNI, Edu.Exp, Edu2.FM and Life.Exp are negatively correlated with PC1, while Ado.Birth and Mat.Mor are positively correlated with PC1. This suggests that low PC1 implies a more developed country, while a higher PC1 implies a less developed country. Note that Edu2.FM is actually related to gender equality in education, which implies that a more gender equal education will lead to a more developed country. Similar with Ado.Birth and Mat.Mor, since a lower adolescent birth and lower maternal mortality directly implies that women have better life standards. Thus from our observation, we can see that higher Ado.Birth and Mat.Mor will lead to higher PC1, which is related to how less developed a country is, as we expected from our correlation plot analysis.

We moreover observe that PC2 is somewhat related to other factors of gender equality. Interestingly, other than Edu2.FM, we have Labo.FM and Parli.FM, which seemingly do not impact a better overall quality of life for the people. This suggests that PC2 is actually related to a more philosophical variable: gender equality in day to day life, such as equality in labor force or equality in parliament. Our analysis suggest that these variables are not correlated with GNI or other variables which imply a developed country. In fact, Labo.FM and Parli.FM tend to have almost effect on these variables. This confirms our preliminary correlation plot analysis: Labo.FM and Parli.FM are not correlated with any of the other variables.

4.2.5 Tea time

We have the following tea data:

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

#look at the structure and dimension of tea
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# for viewing the tea data
# View(tea)

There are 300 observations (individual people) and 36 variables. Briefly, the tea dataset describes how these 300 people drink tea (18 questions) and what are their product’s perception (12 questions). There are also 4 personal questions like age, sex, occupation and age quantile.

4.2.6 MCA on tea

We will use Multiple Correspondence Analysis (MCA) on the tea data. We choose the same columns as in the Exercise set 5, namely “Tea”, “How”, “how”, “sugar”, “where”, “lunch”.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")+geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

We now interpret the results. Recall that MCA is a data analysis technique for nominal categorical data (i.e. factor variables), used to detect and represent underlying structures or patterns in a data set. It does this by representing data as points in a low-dimensional Euclidean space. In our case, as one can observe from the summary, MCA generated an 11 dimensional space (corresponding to 11 eigenvalues), with most of the variance focused at the 1st and the 2nd dimensions, about 15.24% and 14.23% of the total variance, respectively.

The first plot shows the number of occurances of each of the answers within a specific categorical variable. Observe that almost everyone drank tea not during lunch, and almost everyone drank tea with no additives. There is an almost 50-50 divide between with sugar vs. with no sugar. People also seem to prefer drinking tea from a teabag, bought from a chain store. Early grey seems to be the most popular type by far.

The second plot visualized the MCA. It give different relationships of different variables. On the plot, each color represents a variable, of which we have 6. There is an intriguing pattern emerging from the MCA plot:

  1. People seem to buy unpackaged tea from a tea shop
  2. People that buy tea from a chain store prefer to drink it not during lunch
  3. People prefer to drink earl grey tea with milk and sugar
  4. People prefer to drink black tea with no sugar.
  5. People seem to use teabags as a preferred method for storing and preparing tea, while also putting no additives in the tea.
  6. Finally, people who prefer to buy their tea from both chain stores and tea shops, also prefer to buy their tea in teabags or unpackaged. These seem to be the most indecisive or generic people :)

Assignment 6: Analysis of longitudinal data

In this final week, I have worked on analysis of longitudinal data. I think throughout these 6 weeks, I grew accustomed to R as much as I can for the time being. For now, I will take a break from R and focus on other tasks :) It was a perfect course for me, even though I feel like I may not have learned what I should have learned.

date()
## [1] "Tue Dec 13 06:29:50 2022"

4.1: Data wrangling

Data wrangling part of this week was a short but important task. The R code of the data wrangling part is in the data folder of my Github repository. I will put the link here as well: https://github.com/bbayraktaroglu/IODS-project/blob/master/data/meet_and_repeat.R

4.2: Analysis

Setting up the packages

library(tidyverse)
library(dplyr)
library(ggplot2)
library(tidyr)
library(lme4)

4.2.1: Reading the dataset for all of the assignment

# set working directory
setwd("~/Github/IODS-project")

# reading the required files for the assignment
RATS <- read_csv("data/RATS.csv")
## Rows: 16 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): ID, Group, WD1, WD8, WD15, WD22, WD29, WD36, WD43, WD44, WD50, WD5...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
BPRS <- read_csv("data/BPRS.csv")
## Rows: 40 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): treatment, subject, week0, week1, week2, week3, week4, week5, week...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We now convert both of the data to long format:

# convert the categorical variables of both data sets to factors

BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

# convert the data sets to long form, add a week variable to BPRS and a time variable to RATS

BPRSL <-  pivot_longer(BPRS, cols=-c(treatment,subject),names_to = "weeks",values_to = "bprs") %>% arrange(weeks)
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))
rm(BPRS)

RATSL <- pivot_longer(RATS, cols=-c(ID,Group), names_to = "WD",values_to = "Weight")  %>%  mutate(Time = as.integer(substr(WD,3,4))) %>% arrange(Time)

4.2.2: Part 1: RATS

We start with the longitudinal analysis for the dataset ‘RATS’.

# checking the columns of the long data
colnames(RATSL)
## [1] "ID"     "Group"  "WD"     "Weight" "Time"
# dimensions of the long data
dim(RATSL)
## [1] 176   5
# structure of the long data
str(RATSL)
## tibble [176 × 5] (S3: tbl_df/tbl/data.frame)
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: num [1:176] 240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
# summaries of the long data
summary(RATSL)
##        ID      Group       WD                Weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

The ‘RATS’ data was obtained as a nutritional study on three groups of rats (16 rats in total), where each of them were put under a different type of diet. Throughout several weeks, their weights (with unit in grams) were recorded. The aim was to see how different type of diet affect the weight of rats.

We see that there are 176 observations and 5 variables in the long format data. The variables are:

  • ID (identification of the rat; factor variable between 1-16)
  • Group (group of the rat; factor variable between 1-3)
  • WD (which day the measurement took place; character with 11 different values)
  • Weight (weight of the rat in grams; numeric)
  • Time (day of the measurement; integer with 11 different values)

4.2.2.1 Plots

# Plot the RATSL data
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight))) 

Observe that rats in Group 1 and Group 3 have quite close weights within each group, while there is an obvious outlier in Group 2. On average, the Group 1 rats have the lowest weight, while on average the Group 3 rats have the highest weight. The outlier rat in Group 2 has the most weight out of all the rats. Moreover, as time passes on, there is an overall increase of weight for individual rats.

4.2.2.2 Standardization of variables

# standardize the variable
RATSL<-RATSL %>%
  group_by(Group) %>%
  mutate(stdWeight = (Weight-mean(Weight))/sd(Weight)) %>%
  ungroup()
# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# Plot again with the standardised RATSL
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized RATS weight")

Now, after standardization of the RATS data, we can finally see explicit changes in the data and compare the different groups more clearly. Of course we need to do further analysis to come up with an interpretation of what’s going on.

4.2.2.3 Summary graph

# Number of rats:
n <- 16


# Summary data with mean and standard error of Weight by Group and Time 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

Now, we can see in the plot above that there is an overall (on average) increase in weight within each group over time. Group 2 has observed the most amount of increase in weight when we compare the initial and the final weights, while Group 2 saw the least amount of increase.

4.2.2.4 Finding the outlier(s)

# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline Time 0).
RATSL8S <- RATSL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Draw a boxplot of the mean versus Group
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Time 1-64")

We see from the boxplot above that all of the groups have a single outlier. Group 2 has an outlier well above the mean of its other data points, while Group 1 and 3 each have an outlier below the mean of their data points. Group 1 seems to have a symmetric distribution, while Group 2 has a highly skewed distribution, with its longer tail concentrated towards below its mean (i.e. it is left skewed). Note that its median is towards the longer tail. Group 3 also has a tiny bit of skewness in its distribution, with its longer tail towards higher values than its mean.

4.2.2.5 Removing the outlier(s)

# Create a new data by filtering the outlier and adjust the ggplot code, then draw the plot again with the new data
RATSL8S1 <- filter(RATSL8S, (Group==1 & mean>250) | (Group==2 & mean < 590) | (Group==3 & mean>500))
# note how we ignore the corresponding outliers for each group

ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Time 1-64")

We can see from the boxplots above that we have successfully removed the outliers. Note that the distribution of each of the groups have changed (in some significantly) as can be seen in the new boxplots. Especially, Group 2 has lost most of its skewness and Group 3 has lost all of its skewness. Group 1 unfortunately gained some skewness towards high values, but this gain is too small to be of importance.

4.2.2.6 Anova test

Note that we have 3 groups, so we cannot do a t-test for the RATS data (we need to have 2 groups like in BPRS data to be able to perform a t-test). We will just do Anova test.

# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATS$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline+ Group, data = RATSL8S2)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the baseline value is highly significant, with a p-value of about \(5*10^{-15}\). This implies that the initial weight of the rats have a significant effect on the increase in weight of the rats. But, Group variable is not so significant, with a p-value of about \(0.07586\), which is greater than \(0.05\). This imples that we cannot reject the null hypothesis, which is in this case was the fact that different groups should have different weights, i.e. that different type of diets have an effect on the increase in weight of the rats.

4.2.3: Part 2: BPRS

We continue with the longitudinal analysis for the dataset ‘BPRS’. Note that we have already loaded the data, and made it into long format, which was saved as ‘BPRSL’.

# checking the columns of the long data
colnames(BPRSL)
## [1] "treatment" "subject"   "weeks"     "bprs"      "week"
# dimensions of the long data
dim(BPRSL)
## [1] 360   5
# structure of the long data
str(BPRSL)
## tibble [360 × 5] (S3: tbl_df/tbl/data.frame)
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr [1:360] "week0" "week0" "week0" "week0" ...
##  $ bprs     : num [1:360] 42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int [1:360] 0 0 0 0 0 0 0 0 0 0 ...
# summaries of the long data
summary(BPRSL)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252

The ‘BPRS’ data was obtained from 40 male subjects who were randomly assigned into one of two separate treatment groups. Each subject was rated on the brief psychriatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses several symptoms such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

The long format data ‘BPRSL’ contains 360 observations and 5 variables. The variables are:

  • treatment (type of treatment given to the subject; factor variable between 1-2)
  • subject (identification of the subject; factor variable between 1-20)
  • weeks (which week the measurement took place; character with 9 different values)
  • bprs (BPRS score; numeric)
  • week (week of the measurement; integer with values in 0-8)

4.2.3.1 Plots

# Plot the BPRSL data, note that linetype gave errors since we do not have 'continuous lines', so we used col here. But this is fixed below.

ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(col = treatment))+
  scale_y_continuous(name = "BPRS")+
  theme(legend.position = "top")

Our naive plot above is a little bit messy. We notice that we can differentiate the ‘subject’ variable for treatment 1 and for treatment 2 into two separate groups, instead of labeling the same subject for two different treatments. This can be done quite easily:

# Mutating the subject variable. We identify the treatment 1 subjects to be from 1-20, and treatment 2 subjects to be from 21-40. 

BPRSL$subject <- as.numeric(BPRSL$subject)
BPRSL <- mutate(BPRSL, subject = ifelse(treatment == "2", subject+20, subject))
BPRSL$subject <- factor(BPRSL$subject)

# New plot. We can use col here if we want instead of linetype, it does not matter
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype = treatment))+ 
  scale_y_continuous(name = "BPRS")+
  theme(legend.position = "top")

There is an overall decreasing trend in the BPRS variable. Treatment 2 seems like it has a lot of variance towards week 8, while treament 1 does not have that much variance.

4.2.3.2 Regression analysis

Here we assume independence of measurements in BPRS throughout several weeks to create a regression model.

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs~ week + treatment, data=BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

The output of the regression model indicates that the “week” variable is highly significant, with a p-value less than \(2*10^{-16}\). The estimate is about \(-2\), which implies that we expect a decrease in BPRS as we increase the week variable. This supports our earlier claim that as the week went on, we observed a decrease in BPRS.

But the treatment variable is not significant at all, with a p-value of about \(0.661\). This implies that we must reject the null hypothesis: there is no significant evidence to support the claim that the two types of treatments affect BPRS in a different manner. Both multiple and adjusted R-squared are also quite low, which implies that the variables we chose are not good explanatory variables, which is probably due to treatment variable.

If we repeat the analysis with just the week variable as the explanatory variable:

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs~ week, data=BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.740  -8.740  -3.388   6.889  50.530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.7400     1.2003  38.940   <2e-16 ***
## week         -2.2704     0.2521  -9.005   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.35 on 358 degrees of freedom
## Multiple R-squared:  0.1847, Adjusted R-squared:  0.1824 
## F-statistic:  81.1 on 1 and 358 DF,  p-value: < 2.2e-16

We see that the adjusted R-squared value has increased, which means week is definitely a good explanatory variable, although it is still quite low,

4.2.3.3 Random intercept model

Now we do not assume independence of BPRS measurements.

# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000

Random intercept model allows the linear regression fit for each subject to differ in intercept from other subjects. We also forgo the independence assumption. Observe that the estimated standard deviation of the “subject” variable is about \(9.869\), which is almost one order of magnitude above \(1\). This implies that the intercept of each subject varies quite a lot. The estimates of the “week” and “treatment” variables are exactly the same compared to the regression model, but the t-values have changed.

4.2.3.4 Random intercept and random slope model

We now create a random intercept and random slope model.

# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## week         -2.2704     0.2713  -8.370
## treatment2    1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRS_ref     5 2582.9 2602.3 -1286.5   2572.9                         
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept and in slope. Thus, one can account for the differences in each subjects’ change profile throughout the weeks.

Here, we see that the estimate for the “week” variable is similar compared to the previous model, while the estimate for the “treatment” variable has increased by a factor of \(3\). Thus, one can finally see that the choice of treatment can have an impact on the BPRS result. But we do not see which treatment works best.

4.2.3.5 Anova test

We now compute an anova test, to compare the variances between the two models above:

# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRS_ref     5 2582.9 2602.3 -1286.5   2572.9                         
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value is quite small. One can conclude that “BPRS_ref1”, which is the random intercept and random slope model, gives a better fit of our data.

4.2.3.6 Model with interaction

# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + week*treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.9840  16.047
## week             -2.6283     0.3752  -7.006
## treatment2       -2.2911     4.2200  -0.543
## week:treatment2   0.7158     0.5306   1.349
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.668              
## treatment2  -0.707  0.473       
## wek:trtmnt2  0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1    7 2523.2 2550.4 -1254.6   2509.2                    
## BPRS_ref2    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

As was in the Exercise Set 6, we have added an interaction of the form “week x treatment” to the random intercept and random slope model. We compared the interaction model with the previous model using Anova test. One can see that the p-value is about \(0.1821\), which is especially large: it is larger than \(0.1\), which implies there is not a strong indication that the new model fits the data better. One can conclude that the previous model fits the data better than the interaction model.

4.2.3.7 Plotting the fitted values for the best fit

We now plot the fitted values for the random intercept and random slope model, which was the best possible fit obtained from our analysis.

# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref1)


# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(Fitted)

# draw the plot of BPRSL with the Fitted values of BPRS
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
  geom_line(aes(linetype = treatment))+
  scale_y_continuous(name = "Fitted BPRS")+
  theme(legend.position = "top")

One can see that overall, there is a decreasing trend in BPRS for almost all of the subjects in both of the treatment groups. This seems to imply that both of the treatments mostly alleviate the symptoms of BPRS successfully. But we can still not see which treatment is more effective.

(This is the end of the course!)